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Abstract 


We consider the problem of learning convex aggregation of models, that is as good as the best convex aggregation, 
for the binary classification problem. Working in the stream based active learning setting, where the active learner has 
to make a decision on-the-fly, if it wants to query for the label of the point currently seen in the stream, we propose 
a stochastic-mirror descent algorithm, called SMD-AMA, with entropy regularization. We establish an excess risk 


bounds for the loss of the convex aggregate returned by SMD-AMA to be of the order of O f j , where 


fi € [0,1) is an algorithm dependent parameter, that trades-off the number of labels queried, and excess risk. We 
demonstrate experimental results on standard UCI datasets, and show that, compared to a passive learning, SMD- 
AMA queries 10-68% of the points, to get to similar accuracy as a passive learner. 


1 Introduction 


Machine learning techniques have become popular in many fields such as Astronomy, Physics, Biology, Chemistry, 
Web Search, Finance. With the availability of large amounts of data, and greater computational power, newer machine 
learning algorithms, and applications have been discovered. A very popular subclass of machine learning problems 
fall under the category of supervised learning problems. Classification and regression are two most popular problems 
in supervised learning, where the learner is provided with labeled data, and is required to predict the labels of unseen 
points. In the case of classification problems, these labels are discrete, whereas in the case of regression problems, 
these labels are continuous. 

Supervised learning critically relies on the presence of labeled data. The cost of obtaining labels for different 
data points depends on the problem domain. For example, in Astronomy it is easy to get access to tons and tons 
of unlabeled data. However, obtaining labeled data is usually hard. In problems of speech recognition, information 
extraction, obtaining labeled data is often tedious and requires domain expertise |Xiaojin| |2005| [Settles et al. , |2008[ 
Settles 20091. In such cases, a natural question that arises is that can we learn with limited supervision? 

In the classical passive, binary classification problem one has access to labeled samples S = {(xi,yi), ..., (xt,Ut)}, 
drawn from an unknown distribution P defined on a domain X x {—1, +1}, where X C M. d . The points {x \,..., xt } 
are sampled i.i.d. from the marginal distribution P%, and the labels y \ ,.... yr are sampled from the conditional dis¬ 
tribution Py\x=x- Algorithms such as SVMs, logistic regression choose a hypothesis class 'H, and an appropriate loss 
function L(-), and solve some sort of an empirical risk minimization problem to return a hypothesis h £ TL, whose 
risk, R(h) = E XlV ~pL(yh(x)) is small. Given a collection of models B = {&i,..., &m}. the collection of convex 
aggregations of these models is a much richer class. Based on this idea many algorithms in machine learning and 
in approximation theory have been proposed which learn an aggregation of basic models. Ensemble methods [Diet- 


terich [2000 [Schapire 2003i| are methods that combine a large number of simple models to learn a single powerful 
model. Boosting algorithms such as AdaBoost |Freund and Schapire| 1995|, and LogitBoost [Friedman et al.| 20001 


*Most of the work was done while the author was at Georgia Institute of Technology, Atlanta, GA, USA. 
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can be viewed as performing aggregation via functional gradient descent |Mason et al. 1999) . The key idea here is 
minimization of a convex loss, exponential loss in the case of AdaBoost, and logistic loss in the case of LogitBoost, 
via a sequential aggregation of models in T. In the case of boosting, the set of base models are weak learners, and 
by aggregating the models we aim to boost the learning capabilities of the final aggregated model. The problem of 
model aggregation for regression models was first proposed by Nemirovski 1 2000| . Given a collection of models 
B = {bi ,..., 6m}- Nemirovski [2000] outlined three problems of model aggregation, namely, model selection, con¬ 


vex aggregation and linear aggregation. In this paper we are interested in actively learning a convex aggregation 
of models for the binary classification problem. Let b(x) = [6| (x), ..., b M (x)] £ {-1, +1} M , and (•, •) denote the 
Euclidean dot product. Given a convex, margin based loss function L : R —> R_, we want a procedure that outputs 
a model, /(x) = Pj^j( x ) i n the convex hull of B, whose excess risk, when compared to the best model in the 

convex hull of B satisfies the inequality 


EL(y((3,b{x))) < min EL(y{0,b(x)}) 


ot,m 5 


where Sr. m > 0 is a small remainder term that goes to 0 as T —> oo, and the expectation is w.r.t. all the random 
variables involved. In order to construct such a /? vector, we assume that we have access to an unlabeled stream of 
examples xi,X 2 , ■ ■ ., drawn i.i.d. from the underlying distribution P defined on X. This problem matches the one 
posed by Nemirovski, except that instead of being given labeled data, we now have access to an unlabeled data stream, 
and are allowed to query for the labels which we believe are most informative. 

Juditsky et al. 1 20051 studied the original version of the convex aggregation problem posed by Nemirovski, which 


involved learning the best convex aggregation given that one has access to a stream of labeled examples sampled 
i.i.d. from the underlying distribution. They introduced an online, stochastic mirror descent algorithm for the problem 
of learning the best convex aggregation of models. We shall call their method SMD-PMA N They showed that by 
making one pass of the stochastic mirror descent algorithm, and by averaging the iterates obtained after each step of 


the algorithm, the resulting convex aggregate has excess risk of O 


\°g(Ai) j, w here M is the number of models 

being aggregated, and T is the number of samples seen in the stream. Essentially, SMD-PMA is a slight modification 
of the stochastic mirror descent algorithm, applied to the stochastic optimization problem 


min E L(y(6 1 b{x)}), 

where Am is the standard M — 1 dimensional probability simplex. Since the constraint set is a simplex, the entropy 
regularizer was used in SMD-PMA. The stochastic mirror descent procedure is followed by an averaging step, that 
allows the authors to obtain excess risk bounds. 

Contributions. In this paper, we are interested in learning convex aggregation of models, with the help of actively 
labeled data. We consider a streaming setting, where we are given unlabeled points xi, X 2 ,... and an oracle O. The 
oracle O, when provided as an input x in V, returns a label y £ {—1, +1} ~ P[E|X = a;]. We present an algo¬ 
rithm which is essentially a one-pass, stochastic mirror-descent based active learning algorithm, called SMD-AMA^ 
which solves the stochastic optimization problem minggAj/ ^x,y[L(y(0, 6(x)))]. Since we are dealing with simplex 
constraints, we use the entropy function as the regularization function in our stochastic mirror descent algorithm. In 
round t of SMD-AMA, we query for the label of point Xt, with probability p t . This allows us to construct an unbiased 
stochastic subgradient of the objective function at the current iterate. If the length of the stream is T points, then 
SMD-AMA returns the hypothesis br(x) = (0T,b(x)), 9 t G Am- We show that the excess risk of the hypothesis 
br, w.r.t. the best convex aggregate of models, scales with the number of models as i/log(M), and decays with the 
number of points, T, as -p===, where // £ [0,1) is an algorithmic parameter. The mild dependence on the number of 
models, M, allows us to use a large number of models, which is desirable when we are learning convex aggregation 
of models. 

The paper is organized as follows. In Section [3] after briefly reviewing the stochastic mirror descent algorithm, 
we introduce our algorithm SMD-AMA. In Sectiormlwe present an excess risk bound for the hypothesis returned by 

1 SMD-PMA stands for Stochastic Mirror Descent based Passive Model Aggregation 

2 Stochastic Mirror Descent based Active Model Aggregation 
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our active learning algorithms. Section previews related work, and Section [6] compares our proposed algorithm to a 
passive learning algorithm, and a previously proposed ensemble based active learning algorithm. 


2 A brief introduction to mirror descent 


Mirror descent (MD) | Beck and Teboulle] |2003 1 is a first order optimization algorithm used to solve convex optimiza¬ 
tion problems of the form rnin , cc fix). MD assumes the presence of a (sub)gradient oracle, which when provided a 
point x £ C, returns the sub-gradient V/(xj. In each iteration of the mirror descent algorithm, given the current iterate 
Xt, MD solves a minimization problem of the form mm xe c(V f(xt), x)+Dn(x, Xt), where V f(xt) is the sub-gradient 
of / at x t and /> R (•, •) is the Bregman divergence corresponding to a strictly convex function 1Z, which we shall call 
as the regularization function. The power of the MD algorithm stems from the fact that by an appropriate choice of 
the distance generating function, one could adapt to the geometry of the constraint set C. The classical mirror descent 
algorithm has also been extended to stochastic optimization problems of the form min a;e c{/(2:) = E w F(x;w)}, to 
yield the stochastic mirror descent (SMD) algorithm |Nemirovski et al., |2009| |Lan et ak]|20 11 1 . In SMD, we assume 
that an oracle, provides us with an unbiased estimate of the gradient to the stochastic objective. 


3 Algorithm Design 


Standard analysis of stochastic mirror descent algorithm assumes that we have access to a stochastic (sub)gradient 
oracle which provides an unbiased estimate of the gradient of the objective function at any point in the domain. The 
stochastic optimization problem that we are trying to solve is 

min {/(0) = E L(y(6, 6(a:)))}. (1) 

06 Am 


A naive application of the stochastic mirror descent method would require, in each iteration, to obtain a stochastic 
subgradient of E L(y(0, b(x))). In iteration t, if the current iterate is 6 t _ 1; then a stochastic subgradient of f(0) at 
9 = 6 t -1 is given by 

V/(0 t -i) = L'(y t (O t -i,b(x t )))y t b(x t ), (2) 

where L' (•) is the subderivative of L at the given argument. If in round t, we decided to query for the label of the 
point x t , then one can calculate the stochastic subgradient using Equation[2] However, if we decided not to query for 
the label of xt, then the stochastic subgradient, which depends on the unknown label y t of Xt, cannot be calculated. 
While one could, in such a case, consider the stochastic subgradient to be the zero vector, this is no longer an unbiased 
estimate of the subgradient. This is problematic, as the classical analysis of stochastic mirror descent, assumes that 
one has access to unbiased estimates of the subgradient of the objective function. In order to counter this problem we 
use the idea of importance weighting. 

Importance weighted subgradient estimates: In order to use importance weights, in round t upon seeing a point 
Xt, we query xt with probability p t . Suppose Q t is a {0,1} random variable, which takes the value 1, if x t was 
queried, and takes the value 0, if x t was not queried. Let Z t = y t Qt- We shall also make the following independence 
assumption. 

Assumption 1. p t ALy t \x t , xut-i, 


where is the collection of random variables Z \...., Z t -\. Consider the following importance-weighted 

stochastic subgradient g t = ^L'(y t (9t~ i, b(xt)))ytb{xt). It is easy to see that, under AssumptionjlJ given 9 t - 1 , gt 
is an unbiased estimator of Ve/(0t-i). Hence, the above importance weighted stochastic subgradient allows us to get 
unbiased estimates of the gradient of the objective function in Equation IT] Importance weighting was first introduced 
for stream based active learning problems by Beygelzimer et al.] |[2009 [. The use of importance weights helps us 
construct unbiased estimators of the loss of any hypothesis in a hypothesis class. This is particularly favourable, when 
the hypothesis class used for obtaining actively labeled dataset, goes out of favour. 

Before we dive into the details of SMD-AMA we shall need some notation, and terminology, that is relevant for 
the explanation of the our algorithm. Most of this exposition is standard and is taken from Juditsky et al. | 2005| . Let 
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E = , be the space of R A/ equipped with t\ norm. Let E* = be the corresponding dual space, equipped with 

the foo norm. 

Definition 1. Let A m C E, be the simplex, and let V : Am -tKfcea convex function. For a given parameter /?, the 
f3 Fenchel dual ofV is the convex function Vf : E* —> R, defined as 

V${£)= sup [-(£,<?}-/TU(0)]. 

Q£Am 

Finally, as mentioned in Section |T| to define the SMD algorithm, we need a regularization function. Since our 
constraint set is a simplex, we shall use the entropy function defined as 

= j- Ejii 0 t lo g(^) if ^ € A m 

I oo otherwise 


as our regularization function. 

Design of SMD-AMA. We now have all the ingredients of our algorithm in place. The algorithm proceeds in 
a streaming fashion, looking at one unlabeled data point at a time. Step 4 of SMD-AMA, calculates the probability 
of the point being labeled +1 by the current convex aggregate, and this calculation is used in Step 5 to calculate the 
probability of querying the label of x t . Notice that the probability of querying a point, in round t, is always at least 
et > 0. Value of e t is set in Step 3. Step 7 calculates the importance weighted gradient, which is used in Step 9 to 
calculate the new iterate Of. By straightforward calculus, one can show that Step 9, leads to the following iteration 


0t,j (x exp(-£ tiJ -//3), 


where is the j th component of the vector £ t . In Step 4, we use properties of the loss function in order to estimate 
pf. It is well known that standard loss functions such as exponential, squared loss, which are used in classification, 
are also proper losses for probability estimation [Zhang 2004 Buja et al., 2005, Reid and Williamson, 2009]. Hence, 
given the loss function, via standard formulae it is easy to estimate the conditional probability P[Y( = 1 \X t = 1], 
For instance, if one were to use the squared loss L(yz) = (1 — yz ) 2 , then our estimate for pf is given by the 
formula max (0, min l)). In our case, the value of z, in round t, is given by (0 t _ i, b(x t )), and hence, p£ = 

max ^0, min ? 1 


Algorithm 1 SMD-AMA (Input: A margin based loss function L, Labeling Oracle O, Parameters p € [0,1), /3q > 0) 

L Initialize 9 0 = [jj ,..., ±} T , £ 0 = [0,..., 0], t = 1 

for t=l,.. . do 

2. Receive x t . 

3. Set e t = 

4. Estimate pf = P [Y t = \\X = x t ,9 t -i] 

5. Calculate p t = 4 pf (1 - pf)(l - e t ) + e t . 

6. Query the label of xt with probability p t . 

7 ' Set 3* = (yt(6t-i,b(x t )))y t b(xt). 

8. Set Q_-| + g t . 

9. Update /3 t = /3 0 (f + 1) . 

10. Calculate 9 t = -VTe^(^). 

Ut<-t + l. 

end for 

12. Return 9 T = 9t 


A technical point is that. Step 9 in Algorithm [I] is performed irrespective of whether the point received in round 
t is queried or not. This allows the evolution of 3 t to be deterministic, and hence allows us to use a certain lemma of 
Juditsky et al | 2005| , that is crucial to our excess risk analysis . As a result of this step, our current convex aggregate 
Of will be different from 9 t - 1 , even if the point Xt was not queried. 
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4 Excess Risk analysis 


Theorem 1. Let B = { b \,..., 6m} be a collection of basis models, where for each x £ X, bj(x) £ {—1, +1}- For 
any x in X, let b(x) = [6i(a;), ..., 6m0e)] t - Let, R(9) = E L(y(9, b(x))). Then, for any T > 1, and any 9 £ Am, 

I 

when SMD-AMA is run with the parameter /3 q = \j — „-(M) V A''' : ‘+ 1 (i+ —’ T £ [0,1), then the convex aggregation, 

9 t , returned by the SMD-AMA algorithm after T rounds, satisfies the following excess risk inequality 


E R(0 t ) < R(6) + 2 VT^-i 


LlV2^log(M) 


4.1 Proof of Theorem [Tj 

Let R(0) = E L(y(9, b(x))). By definition, 


9t — 


T 


Since L is a convex function, hence by Jensen’s inequality 


E R{0 T ) - E R(6) < 


Yft=i Ei?(0t- 1 ) - EJ?(0) 

T 


Since R{9) is a convex function of 9, hence we can use the subgradient to build an under-approximation to get 

R(0 t -i) - R{0) < (VR(e t -i), -o + 0t-i>. 

Putting together Equations [3] [4] we then get 

Et=i E(V72.(0t_ 1 ), 9 t -\ — 9 ) 


E R(9 t ) - E R{6) < 


T 


< = \Pr log(M) + e]T - E^(0 t _! -9,gt- \7R(9 t _ 1 )) 


TV 


t=i 2/3t ~ 1 t =l 


(3) 

(4) 


In the above set of inequalities, to obtain the second inequality from the first we used Lemma 1 (presented in the 
supplementary material). Since our data stream is drawn i.i.d. from the input distribution, hence E g t — V R(Q t -i ) = 0. 
This gets us 


E R(9 t ) - E R(0) < 


E (VH(0 t _ 1 ),9 t _ 1 -9) 


T 


(a) 1 

< 

- j. 


pt log (m)+ y^E 


T 2 

L <t> 


w #rlog(Af) 
~ T 

< 2VTv~ 1 \ 


L 


t=i 
2 T 


2p t fit-1 


2 m 


E^ 


M-l 


/L2 v / E 3 ilog(M) 
1 + fA 


Inequality (a) was obtained by using the fact that E[^] = 1, and by approximating \L\yfi9, b{xi)))\ < L^. Inequality 

(b) was obtained by substituting for fit-i the expression /3 0 v / f 1+ and upper bounding L w ith J_ Replacing ft? in 
terms of fio, T, and using the expression for /3q as mentioned in the statement of the theorem, and over approximating 
T + 1 by 2 T we get the desired result. 
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5 Related Work 


Madani et al. 120041 considered the problem of active model aggregation, where given many models one has to 


choose a single best model from the collection. They model this problem as the coins problem, where a player is 
provided with a certain number of flips, and is allowed to flip coins until the budget runs out, after which the player 
has to report the coin which has the highest probability of turning up heads. A reinforcement learning approach, to 
the same problem, was taken by Kapoor and Greiner [ 2005) . In this paper, we are dealing with a more complicated 
problem of choosing the best model from the convex hull of a given set of models. Mamitsuka 1 1998) combine the 
ideas of Query-by-committee, and boosting to come up with an active learning algorithm, where the current weighted 
majority is used to decide which point to query next. In order to obtain a weighted-majority of hypothesis, the authors 
suggest using ensemble techniques such as boosting and bagging. Trapeznikov et al. |2010j| introduced the ActBoost 
algorithm, which is an active learning algorithm in the boosting framework. Particularly, ActBoost works under the 
weak learning assumption of Freund et al7]| T 996| , which assumes that there is a hypothesis with zero error rate, in the 
convex hull of the base classifiers. Under this assumption, the authors suggest a version space based algorithm, that 
maintains all the possible convex combinations of the base hypothesis that are consistent with the current data, and 
queries the labels of the points, on which two hypothesis in the current convex hull, disagree. By design, the ActBoost 
algorithm is very brittle. In contrast, we do not make any weak learning assumptions, and hence avoid the problems 
that ActBoost might face when weak learning assumption is not satisfied. Active learning algorithms, in the boosting 
framework, have also been suggested by Iyengar et al. | 2000| , but they do not admit any guarantees, and are somewhat 
heuristic. 


6 Experimental Results 


We implemented SMD-AMA, along with SMD-PMA, and the QBB algorithm [Mamitsuka 1 1998) . As mentioned 
before, QBB is an ensemble based active learning algorithm, which builds a committee via the AdaBoost algorithm. 
QBB works in an iterative fashion. In round t, QBB runs the AdaBoost algorithm, on the currently labeled dataset St, 
with the collection of models B , to get a boosted model ht.. A boosted model is of the form ht.(x) = JN =1 a j,tbj(x), 
where aj. t > 0, for all j = 1,..., M. To choose the next point to be queried, QBB generates a random sample of 
R points from the current set of unqueried points. Suppose this random sample is C t . To choose the next point to be 
queried, we look for that point in C t , whose margin w.r.t. h t is the smallest. We then query for the label of this point. 
This process is repeated until some condition is satisfied (typically until a budget is exhausted). 


6.1 Experimental Setup. 

We used decision stumps along different dimensions, and with different thresholds, to form our set of basis models B. 
A decision stump is a weak classifier that is characterized by a dimension j, and a threshold 9, and classifies a point 
x £ R d as sgn(ari — 9), where Xj is the j th dimension of x. For all our experiments, we used 80 decision stumps along 
each dimensionFj We make our set, B, symmetric by adding —6 to B, if b £ B. Unless otherwise mentioned, // is set 
to 0.30 for all of our SMD-AMA experiments. We report results on some standard UCI datasets]^] The loss function 
used for SMD-AMA, and SMD-PMA is the squared loss. 


6.2 Comparison with Passive Learning 

Our first set of experiments compare SMD-PMA to SMD-AMA. We run both SMD-AMA and SMD-PMA on our 
datasets, and use the hypothesis outputted by these algorithms, at the end of each round, to classify on a test dataset. 
In Figure [T] we plot the test error rate of both the algorithms with the number of points seen in the stream. Note that 
while SMD-PMA gets to see the label of each and every point, SMD-AMA gets to see only those labels which it 
queries. Alternatively, one could plot the test error of different algorithms w.r.t. the number of labels seen. However, 
as mentioned in Section [3] our hypothesis can change between two consecutive queries, and hence it is easier to plot 

3 Using more decision stumps yielded insignificant improvement in test error, but increased computational complexity by a large amount 
4 Our datasets are Abalone, Statlog, MNIST (3 Vs 5), Whitewine, Magic, Redwine. 
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(e) Magic 


(f) Redwine 


Figure 1: Comparison of the test error between SMD-AMA and SMD-PMA with the number of points seen. 


the test error with the number of points seen, rather than number of labels seen. We used the squared loss function 
for training SMD-AMA, and SMD-PMA algorithms. Finally, since SMD-AMA is a randomized algorithm, we report 
results averaged over 10 iterations. From Table [I] it is clear that for all datasets but Whitewine, both SMD-AMA and 


Dataset 

Active 

Passive 

Queries 

Fraction 

Abalone 

0.2889 

0.2922 

440.2 

0.1317 

Statlog 

0.0491 

0.0520 

2984 

0.6728 

MNIST 

0.1496 

0.1442 

931.6 

0.0932 

Whitewine 

0.3075 

0.2864 

406.8 

0.1351 

Magic 

0.2166 

0.2171 

2450 

0.2146 

Redwine 

0.2933 

0.2911 

540 

0.5625 


Table 1: Comparison of the test error between SMD-AMA (second column in the table) and SMD-PMA (third column 
in the table), and number of queries made by SMD-AMA on different datasets. The penultimate column reports the 
number of queries made by the algorithm, and the last column represents the fraction of the training dataset whose 
labels were queried. All results are for the hypothesis returned by the algorithms at the end of the stream. 

SMD-PMA attain almost the same error rate, after finishing a single pass through the stream. Figure [I] shows how the 
test error changes for SMD-PMA and SMD-AMA with the number of points seen in the stream. While in the case of 
Abalone, and Statlog, SMD-AMA quickly catches up with SMD-PMA (and in fact slightly surpasses SMD-PMA), in 
the case of MNIST, the difference between SMD-AMA and SMD-PMA closes only after having seen about 80% of 
the stream. In the case of Whitewine, SMD-PMA is uniformly better than our active learning algorithm, SMD-AMA. 
The difference in error rates, between SMD-PMA and SMD-AMA, at the end of the stream is about 1.32%. In the 
case of Magic, Redwine datasets, the difference in performance of SMD-AMA and SMD-PMA is negligible. 

The number of queries made in Abalone, MNIST, and Whitewine is less than 14% of the length of the stream, 
which implies that we do as well as passive learning, for Abalone, and MNIST, at the expense of far fewer labels. 
In the case of Statlog, and Redwine datasets the number of queries made is comparatively larger, about 67.28%, and 
56.25% of the size of the dataset respectively. On Magic the fraction of queries made is less than 25% of the number 
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Dataset 

SMD-AMA 

SMD-PMA 

Abalone 

0.7992 

0.7611 

Statlog 

0.3242 

0.3396 

MNIST 

0.5517 

0.5381 

Whitewine 

0.7894 

0.7361 

Magic 

0.6315 

0.6305 

Redwine 

0.7455 

0.7660 


Table 2: Comparison of the test loss between SMD-AMA and SMD-PMA for the hypothesis returned by the algorithms 
at the end of the stream. 



(a) Statlog (b) Magic (c) Redwine (d) Abalone 



Figure 2: Comparison of the test loss of SMD-AMA and SMD-PMA with the number of points seen. 


of training points in the dataset. 

In Table [2] we report the loss on the test data, of SMD-AMA, SMD-PMA at the end of the data stream. Figure [2] 
reports the test loss of both the algorithms with the number of samples seen in the stream. Note that while test error is 
always between 0 and 1, the test loss can be larger than 1. In fact for the convex aggregation model that we consider in 
this chapter, and with the squared loss, the maximum loss can be as large as max z6 [_ 1 q (1 — z) 2 = 4. The purpose of 
these experiments was to examine how the difference between the test loss suffered by SMD-AMA, and SMD-PMA, 
changes with the number of points seen in the stream. In the case of Statlog, and MNIST the difference in losses is 
generally smaller than in the case of Abalone and Whitewine, and on Magic, Redwine datasets SMD-AMA generally 
has suffers slightly smaller loss than SMD-PMA. However, since the scale for test loss is larger than 1, these results 
seem to imply that both SMD-AMA and SMD-PMA have comparable rates of decay for the test loss, with the number 
of unlabeled examples seen. Encouraged by these results, we believe it is possible to derive sharper excess risk bounds 
for SMD-AMA. 


6.3 Number of Queries Vs Number of Points Seen 

Figure [3] shows how the number of queries made by SMD-AMA scale with the number of points seen in the steam 
on all the four datasets. This scaling is almost linear in the case of Statlog. This was expected, given the fact that on 
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Figure 3: The number of queries made by SMD-AMA as a function of the number of data points seen. 


Dataset 

Budget 

Error rate of SMD-AMA 

Error rate of QBB 

Abalone 

441 

0.2889 

0.3293 

Statlog 

2984 

0.0491 

0.0407 

MNIST 

932 

0.1496 

0.1756 

Whitewine 

398 

0.3075 

0.3543 

Redwine 

540 

0.2933 

0.3146 

Magic 

2450 

0.2166 

0.2499 


Table 3: Comparison of the error rate of QBB and SMD-AMA for a given budget. 


Statlog, we query the labels of almost 65% of the points in the stream. A similar result holds true even for the Redwine 
dataset. In contrast, on the Abalone, MNIST,Whitewine, and Magic datasets the scaling seems to be sublinear. 

6.4 Effect of Parameter /i. 


Figure 4: The plots show, on the MNIST dataset, as a function of the parameter //., the test error of SMD-AMA, at the 
end of the stream, and the number of queries made. 




(a) Test Error (b) Queries made 


From Steps 3, 5 of SMD-AMA, it is clear that p t > e t . Hence, 


T 

E [Number of Queries] > E et 

t=i 


©(yt-M) if /t < i 

0(l O g(T))if/i = l. 


Theorem [l] says that the excess risk of SMD-AMA has an upper bound that scales as 0(\/T^ -1 ). Hence from this 
discussion we know that large // leads to a smaller lower bound on the expected number of queries made, but a larger 
upper bound on the excess risk. Figure[4]demonstrates, for the MNIST dataset, the trade-off between test error and the 
number of label queries made by SMD-AMA by changing parameter /t. As we can see from these plots, larger // lead 
to small number of queries, but a larger test error. Similar results have been observed on additional datasets. 
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6.5 Comparison with QBB 

In contrast to SMD-PMA and SMD-AMA, QBB is a pool based active learning algorithm, and not a stream based 
active learning algorithm. Hence, QBB has access to the entire set of unlabeled data points, and in each round, can 
choose one data point to query. In order to provide a fair comparison of QBB and SMD-AMA, we used the number 
of queries made by SMD-AMA from our first set of experiments, as a budget parameter for the QBB algorithm, and 
report the error rate of the hypothesis returned by SMD-AMA, and QBB at the end of the budget. It is clear from 
Table [3] that, for all the datasets except Statlog, QBB is significantly inferior to SMD-AMA. For the Statlog dataset, 
the test error of SMD-AMA and QBB are comparable, and this is possibly because the budget to our QBB experiments 
is large. 


7 Conclusions and Discussions 


In this paper we considered active learning of convex aggregation of classification models. We presented a stochastic 
mirror descent algorithm, which uses importance weighting to obtain unbiased importance weighted stochastic gradi¬ 
ents. We established excess risk guarantees of the resultant convex aggregation and experimentally demonstrated the 
good performance of our algorithm. This work can be extended in many directions, some of which are stated below 


Systematic study of test error-number of queries tradeoff. As shown in Section 6.4 parameter fi captures the 
tradeoff between the test error and the number of queries made. A systematic study of how the test error, and the 
number of queries made by SMD-AMA changes with the parameter /.i could help an user to tune the value of j.i, in 
order to be on the “right point” of this test error-number of queries curve. A possible direction, towards this goal, 
is to establish sharper excess risk guarantees for SMD-AMA and upper bound on the number of queries made by 
SMD-AMA, as a function of /i and problem parameters. 


A Excess Risk Bounds 

In our paper we established an excess risk bound for the convex aggregate learned by our active learning algorithm. In 


order to establish our theorem, we need the following key lemma. This result was first established in Juditsky et al. 
12005) (See Proposition 2 in iJuditsky et al. |2005|). We are stating the result only for convenience. 


Lemma 1. For any 8 € 0, and for any t > 1, we have 

5> t -r - e , vnot-i) < P T n(8) - ]T( 0 t -i - e t , a^Q) + ^ ll5i(0t ” l)ll °° 


t=l 


t—1 


t—1 


2/3*-i 


Proof. Let 


K(8) 


log (M) + J2jLi d j log (Oj), if 8 € A m 

oo otherwise. 


(5) 


71(9) is the normalized Entropy function that has been widely used in the mirror-descent algorithm, when optimizing 
over a simplex. The j3 Fenchel dual of JZ is 


n*p(z) = /? log 



M 

srv*i//3 


( 6 ) 


Now, since TZ* g is continously differentiable, we get 


Jo 


( 7 ) 
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Since, TZ is 1-strongly convex w.r.t. the || • ||i norm, the j3— Fenchel dual of TZ is continously differentiable and 
satisfies, for each pair z, z in the domain of 7 Zp 


\\vn;(z)^vTz;(z)\\i<j 3 \\z-z\U 

By definition, + gt(9 t _ 1 ). Substituting this in Equation [7] we get 

Kh-AZt) = Kfh-ifa- 1) + ( Wt-x),VnX_M t + (1 - r)6-t)) dr 

J o 

= nx_^ t - 1 )+ [ (T& + (1 - r)&_i) - V7J^_ (ft_r)) dr 

Jo 

+ [ (g t (6t-i),S77Z*p t l (£ t -i)) dr 

Jo 

< 1 ) + + (gttft-JVIZX^fo- 1 )), 


( 8 ) 


(9) 


where in the last step we applied Holder’s inequality. By design, our sequence /3o, A, • • ■ is an increasing sequence. 
Given fi, TZ*p is a non-increasing function. Hence, by Equation |9] we get 




Summing over t > 1, and telescoping the sum, we get 

T 


1 


^ T (6r) o) < ^(.g t (0i-i), + E 

t=i t=i 1 

Since, by definition of the mirror-descent procedure, '\/TZ*p t i (£t-i) = — 0t-i, hence we get 

T T 


( 10 ) 


( 11 ) 


n T ^T)-n% (<£o) < +E J— 


t=l 


t=l 


2/3*_i 


IL- 


( 12 ) 


Rearranging, and using the fact that = St=i i), we get that for any 9 £ Am , 


E< 0 ‘-! - - «r,0)+E 


ib(0 t -i)ii 


2/3 t -i 


(13) 


Let, At(9t-i) = gt(9 t -i) — VR(9t- 1 ). By definition, EA t (0 t _i) = 0. Replacing, for A t ((9 t _i) in Equation [Tij we 
get 


E(^-i - 0 , Vf?(6» t _ 1 ) + A t (6* t _ 1 )) < 7^ o (0) - 7^ r (fr) - <£r, 0) + E 




2A-t 


Sine, 7Z* g is the /3- Fenchel dual of TZ, and € E*, i.e. 


(3TZ(0) = sup [— z t 9 — 1Z*p{z)\, 


z£E* 


this means that 


- (0>6r) - R*p T (£r) < PtR{9) 


(14) 

□ 

(15) 

(16) 
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Putting together Equations (14 16», we get 

T 


- V, Vft(0 t _!)) < % o ( 0 ) + p T n( 0 ) - Y,( 0 t -1 - e t ,A t {e t -i)) + 




t =1 


t =1 


£=1 


%Pt-i 


< - £> t _i - 0 t) A t (6 t -i)} + 

f—; ttt z Pt-i 


This completes our proof. 
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